Linear Regression and Pairs Trading

@ Jul 2017

MIT License

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.



In [1]:

    
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import statsmodels
import statsmodels.api as sm



In [2]:

    
tickers = ['PEP', 'KO']
data = get_pricing(symbols(tickers), start_date='2006-01-01', end_date='2008-08-01',
                  fields='close_price', frequency='daily')
data.head()









    Out[2]:






  
    
      
      Equity(5885 [PEP])
      Equity(4283 [KO])
    
  
  
    
      2006-01-03 00:00:00+00:00
      56.785
      38.273
    
    
      2006-01-04 00:00:00+00:00
      56.747
      38.208
    
    
      2006-01-05 00:00:00+00:00
      56.424
      38.404
    
    
      2006-01-06 00:00:00+00:00
      56.633
      38.732
    
    
      2006-01-09 00:00:00+00:00
      56.367
      38.938



In [3]:

    
data.columns









    Out[3]:





Index([Equity(5885 [PEP]), Equity(4283 [KO])], dtype='object')



In [4]:

    
data.columns = [ticker.symbol for ticker in data.columns]
data.index.name = 'Date'
data.head()









    Out[4]:






  
    
      
      PEP
      KO
    
    
      Date
      
      
    
  
  
    
      2006-01-03 00:00:00+00:00
      56.785
      38.273
    
    
      2006-01-04 00:00:00+00:00
      56.747
      38.208
    
    
      2006-01-05 00:00:00+00:00
      56.424
      38.404
    
    
      2006-01-06 00:00:00+00:00
      56.633
      38.732
    
    
      2006-01-09 00:00:00+00:00
      56.367
      38.938



In [5]:

    
cumm_rtn = (1 + data.pct_change()).cumprod()
cumm_rtn.plot();
plt.ylabel('Cummulative Returns');
plt.xlabel('Time');
plt.title('Cummulative Plot of Pepsi and Coca-Cola');

Visualise Price Relationship Between the Two Securities



In [6]:

    
colors = np.linspace(0.1, 1, len(data))
sc = plt.scatter(data[tickers[0]], data[tickers[1]], s=30,
                 c=colors, cmap=plt.get_cmap('jet'), edgecolor='k', alpha=0.7)
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in data[::len(data)//9].index])
plt.xlabel(tickers[0])
plt.ylabel(tickers[1]);

Linear Regression in vector form

$$y = \beta x + \varepsilon$$

where,

\begin{equation*} y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \end{equation*}\begin{equation*} x = \begin{pmatrix} x_1^T \\ x_2^T \\ \vdots \\ x_n^T \end{pmatrix} \end{equation*}\begin{equation*} \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} \end{equation*}\begin{equation*} \varepsilon = \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix} \end{equation*}

Spread

$$ \begin{align*} \varepsilon & = y - \begin{bmatrix} \beta_0 & \beta_1 \end{bmatrix} \begin{bmatrix} 1 \\ x \end{bmatrix} \\ \\ & = p^{KO} - \begin{bmatrix} \beta_0 & \beta_1 \end{bmatrix} \begin{bmatrix} 1 \\ p^{PEP} \end{bmatrix} \\ \\ & = p^{KO} - \beta_0 - \beta_1 p^{PEP} \end{align*} $$

where,

$p = Price$

Estimate with stats model

using statsmodel to estimate the parameters

KO is our dependent variable, and PEP is our independent variable



In [7]:

    
X = sm.add_constant(data[tickers[0]], prepend=False)
ols = sm.OLS(data[tickers[1]], X).fit()
beta = ols.params
y_fit = [X.min().dot(beta), X.max().dot(beta)]



In [8]:

    
print(ols.summary2())









    



                  Results: Ordinary least squares
==================================================================
Model:              OLS              Adj. R-squared:     0.919    
Dependent Variable: KO               AIC:                2837.4049
Date:               2017-07-14 09:02 BIC:                2846.3588
No. Observations:   650              Log-Likelihood:     -1416.7  
Df Model:           1                F-statistic:        7355.    
Df Residuals:       648              Prob (F-statistic): 0.00     
R-squared:          0.919            Scale:              4.5919   
-------------------------------------------------------------------
            Coef.    Std.Err.     t      P>|t|    [0.025    0.975] 
-------------------------------------------------------------------
PEP          1.2688    0.0148   85.7605  0.0000    1.2397    1.2978
const      -31.9220    0.9511  -33.5622  0.0000  -33.7896  -30.0543
------------------------------------------------------------------
Omnibus:              8.992         Durbin-Watson:           0.082
Prob(Omnibus):        0.011         Jarque-Bera (JB):        9.217
Skew:                 -0.289        Prob(JB):                0.010
Kurtosis:             2.926         Condition No.:           728  
==================================================================



In [9]:

    
colors = np.linspace(0.1, 1, len(data))
sc = plt.scatter(data[tickers[0]], data[tickers[1]], s=30,
                 c=colors, cmap=plt.get_cmap('jet'), edgecolor='k', alpha=0.7)
plt.plot([X.min()[0], X.max()[0]], y_fit, '--b', linewidth=3, label='Linear Regression Fit')
plt.legend()

cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in data[::len(data)//9].index])
plt.xlabel(tickers[0])
plt.ylabel(tickers[1]);

It is a perfect fit becasue this is in-sample

Linear Regression with scikit-learn



In [10]:

    
from sklearn.linear_model import LinearRegression



In [11]:

    
reg = LinearRegression(fit_intercept=True)
reg









    Out[11]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [12]:

    
reg.fit(data[tickers[0]].reshape(-1,1), data[tickers[1]])









    Out[12]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [13]:

    
print(reg.coef_)
print(reg.intercept_)









    



[ 1.26875149]
-31.9219750797

Modelling the Spread

$$\varepsilon = y - \beta_1 x - \beta_0$$



In [14]:

    
spread = pd.DataFrame(data[tickers[1]] - beta[0] * data[tickers[0]] - beta[1])



In [15]:

    
spread.columns = ['in-sample']
spread.head()









    Out[15]:






  
    
      
      in-sample
    
    
      Date
      
    
  
  
    
      2006-01-03 00:00:00+00:00
      -1.851078
    
    
      2006-01-04 00:00:00+00:00
      -1.867866
    
    
      2006-01-05 00:00:00+00:00
      -1.262059
    
    
      2006-01-06 00:00:00+00:00
      -1.199228
    
    
      2006-01-09 00:00:00+00:00
      -0.655740



In [16]:

    
spread.plot();
plt.xlabel('Time');
plt.ylabel('Spread');
plt.title('PEP-KO Spread');

Testing for Cointegration



In [17]:

    
adf = statsmodels.tsa.stattools.adfuller(spread['in-sample'], maxlag=1)
print('ADF test statistics: {:.03f}').format(adf[0])
print('p-value: {:.03f}').format(adf[1])









    



ADF test statistics: -3.411
p-value: 0.011

Trading Logic



In [18]:

    
spread.plot();
plt.axhline(spread['in-sample'].mean(), ls='--', color='b');
plt.axhline(spread['in-sample'].mean() + spread['in-sample'].std(), ls='--', color='y');
plt.axhline(spread['in-sample'].mean() - spread['in-sample'].std(), ls='--', color='y');

Out of Sample / Test Set



In [19]:

    
data_oos = get_pricing(symbols(tickers), start_date='2008-08-01', end_date='2010-01-01',
                      fields='close_price', frequency='daily')
data_oos.columns = [ticker.symbol for ticker in data_oos.columns]
data_oos.index.name = 'Date'



In [20]:

    
spread_oos = spread.reindex((spread.index).union(data_oos.index))



In [21]:

    
spread_oos['out-of-sample'] = data_oos[tickers[1]] - beta[0] * data_oos[tickers[0]] - beta[1]



In [22]:

    
spread_oos.head()









    Out[22]:






  
    
      
      in-sample
      out-of-sample
    
    
      Date
      
      
    
  
  
    
      2006-01-03 00:00:00+00:00
      -1.851078
      NaN
    
    
      2006-01-04 00:00:00+00:00
      -1.867866
      NaN
    
    
      2006-01-05 00:00:00+00:00
      -1.262059
      NaN
    
    
      2006-01-06 00:00:00+00:00
      -1.199228
      NaN
    
    
      2006-01-09 00:00:00+00:00
      -0.655740
      NaN



In [23]:

    
spread_oos.tail()









    Out[23]:






  
    
      
      in-sample
      out-of-sample
    
    
      Date
      
      
    
  
  
    
      2009-12-24 00:00:00+00:00
      NaN
      12.105009
    
    
      2009-12-28 00:00:00+00:00
      NaN
      11.762446
    
    
      2009-12-29 00:00:00+00:00
      NaN
      12.151259
    
    
      2009-12-30 00:00:00+00:00
      NaN
      11.838259
    
    
      2009-12-31 00:00:00+00:00
      NaN
      11.789947



In [24]:

    
spread_oos.plot();
plt.axhline(spread['in-sample'].mean(), ls='--', color='b');
plt.axhline(spread['in-sample'].mean() + spread['in-sample'].std(), ls='--', color='y');
plt.axhline(spread['in-sample'].mean() - spread['in-sample'].std(), ls='--', color='y');

What happened?

Lehman Brothers



In [25]:

    
data_all = data.append(data_oos)
colors = np.linspace(0.1, 1, len(data_all))
sc = plt.scatter(data_all[tickers[0]], data_all[tickers[1]], s=50, c=colors, 
                 cmap=plt.get_cmap('jet'), edgecolor='k', alpha=0.7, label='Price Data')
plt.plot([X.min()[0], X.max()[0]], y_fit, '--b', linewidth=3, label='OLS Fit')
plt.legend()
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in data_all[::len(data_all)//9].index])
plt.xlabel(tickers[0])
plt.ylabel(tickers[1]);

Tear Sheet



In [26]:

    
bt = get_backtest('59687abcabb3315736a24c90')









    



100% Time: 0:00:01|###########################################################|



In [27]:

    
bt.create_full_tear_sheet()









    



Entire data start date: 2006-01-03
Entire data end date: 2008-08-01


Backtest Months: 30






    






  
    
      Performance statistics
      Backtest
    
  
  
    
      annual_return
      -0.02
    
    
      cum_returns_final
      -0.04
    
    
      annual_volatility
      0.07
    
    
      sharpe_ratio
      -0.22
    
    
      calmar_ratio
      -0.10
    
    
      stability_of_timeseries
      0.50
    
    
      max_drawdown
      -0.17
    
    
      omega_ratio
      0.96
    
    
      sortino_ratio
      -0.30
    
    
      skew
      -0.36
    
    
      kurtosis
      5.29
    
    
      tail_ratio
      0.91
    
    
      common_sense_ratio
      0.89
    
    
      gross_leverage
      0.70
    
    
      information_ratio
      -0.02
    
    
      alpha
      -0.01
    
    
      beta
      -0.03
    
  








    






  
    
      Worst drawdown periods
      net drawdown in %
      peak date
      valley date
      recovery date
      duration
    
  
  
    
      0
      17.41
      2006-10-18
      2007-10-30
      NaT
      NaN
    
    
      1
      3.18
      2006-04-12
      2006-09-07
      2006-10-18
      136
    
    
      2
      1.00
      2006-03-13
      2006-03-22
      2006-03-30
      14
    
    
      3
      0.78
      2006-02-02
      2006-02-06
      2006-02-09
      6
    
    
      4
      0.55
      2006-02-23
      2006-03-01
      2006-03-03
      7
    
  








    



[-0.009 -0.019]






    












    



/usr/local/lib/python2.7/dist-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    






  
    
      Stress Events
      mean
      min
      max
    
  
  
    
      Lehmann
      -1.22%
      -1.22%
      -1.22%
    
    
      Aug07
      0.04%
      -1.26%
      1.03%
    
    
      Mar08
      0.03%
      -0.47%
      0.61%
    
    
      Low Volatility Bull Market
      -0.02%
      -2.76%
      1.73%
    
    
      GFC Crash
      0.02%
      -2.06%
      1.99%
    
  








    



/usr/local/lib/python2.7/dist-packages/matplotlib/axes/_base.py:2787: UserWarning: Attempting to set identical left==right results
in singular transformations; automatically expanding.
left=733255.0, right=733255.0
  'left=%s, right=%s') % (left, right))






    












    






  
    
      Top 10 long positions of all time
      max
    
  
  
    
      PEP-5885
      70.14%
    
    
      KO-4283
      46.95%
    
  








    






  
    
      Top 10 short positions of all time
      max
    
  
  
    
      PEP-5885
      -104.29%
    
    
      KO-4283
      -70.88%
    
  








    






  
    
      Top 10 positions of all time
      max
    
  
  
    
      PEP-5885
      104.29%
    
    
      KO-4283
      70.88%
    
  








    






  
    
      All positions ever held
      max
    
  
  
    
      PEP-5885
      104.29%
    
    
      KO-4283
      70.88%



In [28]:

    
bt = get_backtest('59687c3e6d859b532e83fc13')









    



100% Time: 0:00:01|###########################################################|



In [29]:

    
bt.create_full_tear_sheet()









    



Entire data start date: 2008-08-01
Entire data end date: 2009-12-31


Backtest Months: 17






    






  
    
      Performance statistics
      Backtest
    
  
  
    
      annual_return
      -0.08
    
    
      cum_returns_final
      -0.11
    
    
      annual_volatility
      0.11
    
    
      sharpe_ratio
      -0.72
    
    
      calmar_ratio
      -0.52
    
    
      stability_of_timeseries
      0.25
    
    
      max_drawdown
      -0.15
    
    
      omega_ratio
      0.87
    
    
      sortino_ratio
      -0.92
    
    
      skew
      -0.70
    
    
      kurtosis
      4.15
    
    
      tail_ratio
      0.75
    
    
      common_sense_ratio
      0.69
    
    
      gross_leverage
      0.74
    
    
      information_ratio
      -0.01
    
    
      alpha
      -0.08
    
    
      beta
      -0.02
    
  








    






  
    
      Worst drawdown periods
      net drawdown in %
      peak date
      valley date
      recovery date
      duration
    
  
  
    
      0
      15.34
      2008-09-04
      2008-10-16
      NaT
      NaN
    
    
      1
      0.00
      2008-08-01
      2008-08-01
      2008-08-01
      1
    
    
      2
      0.00
      2008-08-01
      2008-08-01
      2008-08-01
      1
    
    
      3
      0.00
      2008-08-01
      2008-08-01
      2008-08-01
      1
    
    
      4
      0.00
      2008-08-01
      2008-08-01
      2008-08-01
      1
    
  








    



[-0.014 -0.03 ]






    












    






  
    
      Stress Events
      mean
      min
      max
    
  
  
    
      Lehmann
      -0.13%
      -1.63%
      1.03%
    
    
      Sept08
      -0.26%
      -1.63%
      1.03%
    
    
      2009Q1
      0.18%
      -1.45%
      2.33%
    
    
      2009Q2
      -0.03%
      -1.77%
      1.24%
    
    
      GFC Crash
      -0.03%
      -3.40%
      2.58%
    
    
      Recovery
      -0.04%
      -3.02%
      1.60%
    
  








    












    






  
    
      Top 10 long positions of all time
      max
    
  
  
    
      PEP-5885
      61.61%
    
    
      KO-4283
      49.24%
    
  








    






  
    
      Top 10 short positions of all time
      max
    
  
  
    
      PEP-5885
      -100.17%
    
    
      KO-4283
      -61.40%
    
  








    






  
    
      Top 10 positions of all time
      max
    
  
  
    
      PEP-5885
      100.17%
    
    
      KO-4283
      61.40%
    
  








    






  
    
      All positions ever held
      max
    
  
  
    
      PEP-5885
      100.17%
    
    
      KO-4283
      61.40%

	Equity(5885 [PEP])	Equity(4283 [KO])
2006-01-03 00:00:00+00:00	56.785	38.273
2006-01-04 00:00:00+00:00	56.747	38.208
2006-01-05 00:00:00+00:00	56.424	38.404
2006-01-06 00:00:00+00:00	56.633	38.732
2006-01-09 00:00:00+00:00	56.367	38.938

	PEP	KO
Date
2006-01-03 00:00:00+00:00	56.785	38.273
2006-01-04 00:00:00+00:00	56.747	38.208
2006-01-05 00:00:00+00:00	56.424	38.404
2006-01-06 00:00:00+00:00	56.633	38.732
2006-01-09 00:00:00+00:00	56.367	38.938

	in-sample
Date
2006-01-03 00:00:00+00:00	-1.851078
2006-01-04 00:00:00+00:00	-1.867866
2006-01-05 00:00:00+00:00	-1.262059
2006-01-06 00:00:00+00:00	-1.199228
2006-01-09 00:00:00+00:00	-0.655740

	in-sample	out-of-sample
Date
2009-12-24 00:00:00+00:00	NaN	12.105009
2009-12-28 00:00:00+00:00	NaN	11.762446
2009-12-29 00:00:00+00:00	NaN	12.151259
2009-12-30 00:00:00+00:00	NaN	11.838259
2009-12-31 00:00:00+00:00	NaN	11.789947

Performance statistics	Backtest
annual_return	-0.02
cum_returns_final	-0.04
annual_volatility	0.07
sharpe_ratio	-0.22
calmar_ratio	-0.10
stability_of_timeseries	0.50
max_drawdown	-0.17
omega_ratio	0.96
sortino_ratio	-0.30
skew	-0.36
kurtosis	5.29
tail_ratio	0.91
common_sense_ratio	0.89
gross_leverage	0.70
information_ratio	-0.02
alpha	-0.01
beta	-0.03

Worst drawdown periods	net drawdown in %	peak date	valley date	recovery date	duration
0	17.41	2006-10-18	2007-10-30	NaT	NaN
1	3.18	2006-04-12	2006-09-07	2006-10-18	136
2	1.00	2006-03-13	2006-03-22	2006-03-30	14
3	0.78	2006-02-02	2006-02-06	2006-02-09	6
4	0.55	2006-02-23	2006-03-01	2006-03-03	7

Stress Events	mean	min	max
Lehmann	-1.22%	-1.22%	-1.22%
Aug07	0.04%	-1.26%	1.03%
Mar08	0.03%	-0.47%	0.61%
Low Volatility Bull Market	-0.02%	-2.76%	1.73%
GFC Crash	0.02%	-2.06%	1.99%

Performance statistics	Backtest
annual_return	-0.08
cum_returns_final	-0.11
annual_volatility	0.11
sharpe_ratio	-0.72
calmar_ratio	-0.52
stability_of_timeseries	0.25
max_drawdown	-0.15
omega_ratio	0.87
sortino_ratio	-0.92
skew	-0.70
kurtosis	4.15
tail_ratio	0.75
common_sense_ratio	0.69
gross_leverage	0.74
information_ratio	-0.01
alpha	-0.08
beta	-0.02

Worst drawdown periods	net drawdown in %	peak date	valley date	recovery date	duration
0	15.34	2008-09-04	2008-10-16	NaT	NaN
1	0.00	2008-08-01	2008-08-01	2008-08-01	1
2	0.00	2008-08-01	2008-08-01	2008-08-01	1
3	0.00	2008-08-01	2008-08-01	2008-08-01	1
4	0.00	2008-08-01	2008-08-01	2008-08-01	1

Stress Events	mean	min	max
Lehmann	-0.13%	-1.63%	1.03%
Sept08	-0.26%	-1.63%	1.03%
2009Q1	0.18%	-1.45%	2.33%
2009Q2	-0.03%	-1.77%	1.24%
GFC Crash	-0.03%	-3.40%	2.58%
Recovery	-0.04%	-3.02%	1.60%